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COMPONENT SELECTION AND SMOOTHING IN 
MULTIVARIATE NONPARAMETRIC REGRESSION 

By Yi Lin^ and Hao Helen Zhang^ 

University of Wisconsin-Madison and North Carolina State University 

We propose a new method for model selection and model fit- 
ting in multivariate nonparametric regression models, in the frame- 
work of smoothing spline ANOVA. The "COSSO" is a method of 
regularization with the penalty functional being the sum of compo- 
nent norms, instead of the squared norm employed in the traditional 
smoothing spline method. The COSSO provides a unified framework 
for several recent proposals for model selection in linear models and 
smoothing spline ANOVA models. Theoretical properties, such as 
the existence and the rate of convergence of the COSSO estimator, 
are studied. In the special ceise of a tensor product design with pe- 
riodic functions, a detailed analysis reveals that the COSSO does 
model selection by applying a novel soft thresholding type opera- 
tion to the function components. We give an equivalent formulation 
of the COSSO estimator which leads naturally to an iterative algo- 
rithm. We compare the COSSO with MARS, a popular method that 
builds functional ANOVA models, in simulations and real examples. 
The COSSO method can be extended to classification problems and 
we compare its performance with those of a number of machine learn- 
ing algorithms on real datasets. The COSSO gives very competitive 
performance in these studies. 

1. Introduction. Consider the multivariate nonparametric regression prob- 
lem yi = f{xi) + ei,i = 1, . . . ,n, where / is the regression function to be esti- 
mated, Xi = {x[^\ . . . are d-dimensional vectors of covariates and the e 
are independent noise variates with mean and variance o"^ . The estimator is 
judged in terms of prediction accuracy and interpretability. A popular model 
for high-dimensional problems is the smoothing spline analysis of variance 
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(SS-ANOVA) model [10, 19, 20]. In the SS-ANOVA we write 

(1) fix) = 6 + E fM'^) + E fjkix^'^^^'^) + • • • , 

j=i j<k 

where 6 is a constant, the /j's are the main effects, the /jfc's are the two-way 
interactions, and so on. The sequence is usually truncated somewhere to 
enhance interpretability. The identifiability of the terms in (1) is assured by 
side conditions through averaging operators. The SS-ANOVA generalizes the 
popular additive model and provides a general framework for nonparametric 
multivariate function estimation. 

The common approach to estimation in SS-ANOVA is penalized least 
squares, with the penalty being a sum of squared norms of the terms in (1). 
One important question in the application of SS-ANOVA is to determine 
which variables or which ANOVA components should be included in the 
model. Gu [9] proposed using cosine diagnostics as model checking tools 
after model fitting in Gaussian regression. Chen [3] studied interaction spline 
models via SS-ANOVA and developed a nonstandard test procedure for 
model selection. Yau, Kohn and Wood [21] presented a Bayesian method 
for variable selection in a nonparametric manner. Gunn and Kandola [11] 
proposed a sparse kernel approach in a closely related framework. Zhang 
et al. [23] proposed a likelihood basis pursuit approach to model selection 
and estimation in the SS-ANOVA for exponential families. They expanded 
each nonparametric component function in (1) as a linear combination of a 
large number of basis functions and applied an Li penalty to the coefficients 
of all the basis functions. The Li penalty gives a solution that is sparse in 
the coefficients. However, a separate model selection procedure has to be 
applied after model fitting, since sparsity in coefficients helps but does not 
guarantee sparsity in SS-ANOVA components. 

In this paper we introduce a new approach for model selection and es- 
timation in SS-ANOVA. This is a penalized least squares method with the 
penalty functional being the sum of component norms, rather than the sum 
of squared component norms. Our method will be referred to as the Compo- 
nent Selection and Smoothing Operator (COSSO). The general methodology 
is introduced in Section 2, where we also prove the existence of the COSSO 
estimate and give some rate of convergence results. A connection between the 
COSSO and the popular LASSO in linear regression is shown. It turns out 
that when the COSSO formulation is used in linear models, it reduces to the 
LASSO. On the other hand, the COSSO gives an alternative interpretation 
of the penalty term in the LASSO to being the Li norm of the coefficients: it 
is the sum of component norms. Thus the COSSO can be seen as a nontriv- 
ial extension of the LASSO in linear models to multivariate nonparametric 
models. In Section 3 we obtain an alternative formulation of the COSSO 
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that is more suitable for computation. In Section 4 we consider the special 
case of a tensor product design with periodic functions. A detailed analysis 
in this special case sheds light on the mechanism of the COSSO in terms of 
component selection in SS-ANOVA. In particular, we show in this case that 
the COSSO does model selection by applying a novel soft thresholding-type 
operation to the function components. In Section 5 we present a COSSO 
algorithm that is based on iterating between the smoothing spline method 
and the nonnegative garrote [1]. In Section 6 we consider the choice of the 
tuning parameter. Simulations are given in Section 7, where we compare 
the COSSO with the MARS procedure developed by Friedman [8], a pop- 
ular algorithm that builds functional ANOVA models. The COSSO can be 
naturally extended to perform classification tasks, and we also compare the 
performance of the COSSO with that of many machine learning methods on 
some benchmark datasets. These real examples are given in Section 8, and 
Section 9 contains a discussion. The proofs are given in the Appendix. 

2. The COSSO in smoothing spline ANOVA. 

2.1. The smoothing spline ANOVA. In the commonly used smoothing 
spline ANOVA model over X = [0, l]'^, it is assumed that f T, where !F 
is a reproducing kernel Hilbert space (RKHS) corresponding to the decom- 
position (1). Let be a function space of functions of x^^^ over [0, 1] such 
that = {1} © . Then the tensor product space of the H^^s is 

d d 
(2) (g)F^={l}©5]^^©5][^^'0^*^]©---. 

j=l j=l j<k 

Each functional component in the SS-ANOVA decomposition (1) lies in a 
subspace in the orthogonal decomposition (2) of <S>j=iH^. Typically only 
low-order interactions are considered in the SS-ANOVA model for inter- 
pretability and visualization. The popular additive model is a special case 
in which f{x^^\ . . . , x^'^)) = b + J2j=i fji^^^^)' ^i^^ fj G W. In this case the 
selection of functional components is equivalent to variable selection. In 
more complex SS-ANOVA models model selection amounts to the selection 
of main effects and interaction terms in the SS-ANOVA decomposition. The 
interaction terms reside in the tensor product spaces of univariate func- 
tion spaces. The reproducing kernel of a tensor product space is simply the 
product of the reproducing kernels of the individual spaces. This greatly 
facilitates the use of the smoothing spline-type method in such models. 

A common example of the function space of univariate functions is 
the Sobolev Hilbert space. The ^th-order Sobolev Hilbert space is: Si = 
{g : g, g' , . . . ,g^^~^^ are absolutely continuous, g^^^ e£2[0,1]}. Following [19], 
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we define the norm in by 

(3) llfff = E|/^('^)W^4'+ 

With this norm 5^ can be decomposed as the direct sum of two orthogonal 
subspaces 5^ = {1} ® 5^. The spaces Sn and Si are RKHS's and their repro- 
ducing kernels are given in [19]. The second-order Sobolev Hilbert space S2 is 
the most commonly used in practice and will be used in our implementation 
of the COSSO. 

2.2. The COSSO. In general, the function space in the SS-ANOVA can 
be written as 

p 

(4) J^ = {l}®Ti with Jc-^ = 0^", 

where J-^ , are p orthogonal subspaces of T . In the additive model 
p = d and the .F'^'s are the main effect spaces. In the two-way interaction 
model there are d main effect spaces and d{d — l)/2 two-way interaction 
spaces, thus p = d{d + l)/2. We may further decompose the functional com- 
ponents into parametric and nonparametric parts, as is commonly done with 
the smoothing spline method. We do not pursue this in this paper as our em- 
phasis is on the selection of functional components in SS-ANOVA. However, 
the general idea of our procedure can still be applied with this further de- 
composition, and it may be helpful to select parametric and nonparametric 
components of the variables. 

Denote the norm in the RKHS by || • ||. A traditional smoothing spline 
type method finds f £ J- to minimize 

1 n p 

(5) -E{y^-/(^0}' + AE^a'll^"/ll'. 

Tl . ^ i 

1=1 a=l 

where f is the orthogonal projection of / onto T"' and 9^ > 0. If 9a = 0, 
then the minimizer is taken to satisfy = 0. We use the convention 

0/0 = throughout this paper. The smoothing parameter A is confounded 
with the ^'s, but is usually included in the setup for computational purposes. 
We propose the COSSO procedure that finds / G JT to minimize 

(6) - E{y. - f{x^)? + rlJU) with J(/) = E , 

1=1 a=l 

where r„ is a smoothing parameter. We sometimes suppress the dependence 
of T on n in our notation. The penalty term J(/) in the COSSO is a sum 
of RKHS norms, instead of the squared RKHS norm penalty employed in 
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the smoothing sphne. The penahy J(/) is not a norm in T . However, it 
is a convex functional and is a pseudonorm in the sense: for any f^g in 
^, J(/) > 0, J(c/) = |c|J(/), J(/ < J(/) + J(5); for any nonconstant / 
in T ^ J{f) > 0. And we have that 

(7) <j^/)<pEiiP"/f. 

The existence of the COSSO estimate is guaranteed due to the convexity 
of (6), as stated in the following. 

Theorem 1. Let T he an RKHS of functions over an input space X. 
Assume that T can he decomposed as in (4). Then there exists a minimizer 
of (6) in 

2.3. Connection to the LASSO in linear models. In linear models the 
regression function is assumed to be f{x) = /3o + J2j=iPjX^''^ ■ Traditional 
approaches to variable selection include best subset selection and forward/ 
backward stepwise selection. As pointed out by [1], these methods suffer from 
instability and relative lack of accuracy. Several new and effective methods 
for variable selection in linear models have been proposed in recent years 
[1, 5, 6, 7, 15]. Two methods, the nonnegative garrote [1] and the LASSO [15], 
are related to the method in our paper, and are reviewed in the following. 

Let /3° = {(3q, . . . be the ordinary least squares estimates. The nonneg- 
ative garrote solution is (/^g, riP", . . . , r^P^), where (ri, . . . , r^) is the solution 
to 

n r d ^ 2 

' ' i=l I j=l ) 

d 

subject to Tj > 0, j = 1, . . . , d and E ^ ^• 

j=i 

Here t > is a tuning parameter. The nonnegative garrote selects a subset 
and shrinks the estimate at the same time. Breiman [1] showed that the non- 
negative garrote has consistently lower prediction error than subset selection 
with extensive simulation studies. 

The Least Absolute Shrinkage and Selection Operator (LASSO) estimate 
j3 = {Po, . . . , (3d) is the minimizer of 

-T.\yi-f^o-Y.f^J^?\ subject to El/^il^*' 
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or equivalently, the minimizer of 

^ i=i I j=i ) j=i 

where t and A are tuning parameters. The LASSO is a penaUzed least squares 
method with the Li penalty on the coefficients. 

The LASSO can be seen as a special case of the COSSO. For the input 
space = [0, 1]"', consider the linear function space = {1} © {x^^^ — 1 /2} © 
••• © {x^'^^ — 1/2}, with the usual L2 inner product on !F: {f,g) = J^fd- 
The penalty term in the COSSO (6) becomes J(/) = (12)-^/^ l/^il 
f{x) = Po + J2'j=i PjX^-^^ . This is equivalent to the Li norm on the linear co- 
efficients, leading to the LASSO estimator. Notice, however, in the COSSO 
interpretation the penalty is the sum of the norms of the function compo- 
nents, rather than the Li norm of the coefficients. 

2.4. Asymptotic property of the COSSO. In this section we assume a 
fixed design. Define y= (yi, . . . , With a little abuse of notation, let 
/ stand for both the regression function and its functional values at data 
points, that is, / = (/(xi), . . . , Define the norm || * ||^ and inner 
product (•,-)n in -R" as 

-in 1 

then \\y — /||^ = ^/nJ2i'=i{yi ~ fi^i)}^- The following theorem shows that 
the COSSO estimator in the additive model has a rate of convergence 
^-£/(2£+i) ^ where i is the order of smoothness of the components. 

Theorem 2. Consider the regression model yi = foixi) + £i, i = 1, . . . ,n, 
where the xis are given covariates in [0,1]'^, and the £iS are independent 
7V(0,cj2) noise. Assume fo lies in J" = {1} ® J^i , J'i=^j=iS^, with S^ = 
{1} © being the Ith-order Sobolev space with norm (3). Consider the 
COSSO estimate f as defined in (6). Then (i) if fo is not a constant, and 
r-i = 0,(n^/(2^+i))j{2^-i)/(4^+2)(^^)^ f^^^^ ||/-/o||n = Op(r„)Ji/2(/o); 

(ii) if fo is a constant, we have ||/ — /o||n = Op(max{n~^/(^^~-^Vn ^^''^^ ^\n~^/' 

3. An equivalent formulation. It can be shown that the solution to (6) 
is in a finite-dimensional space, and therefore the COSSO estimate can be 
computed directly from (6). 

Lemma 1 . Let f = h + Y^^a=i fa a minimizer of (6) in (A), with fa € 
J^" . Then G span{Ra{xi, = 1, . . . ,n}, where Ra{-, ■) is the reproducing 
kernel of T'^ . 



COMPONENT SELECTION AND SMOOTHING 



7 



However, it is possible to give an equivalent form of (6) that is easier to 
compute. Consider the problem of finding 6 = {9i,. . . ,9p)'^ and / G JF to 
minimize 

1 n p p 

-E{^*-/(^0}' + ^oE^«'ll^"/ll' + ^E^" subject to e^>0, 

i=l a=l a=l 

(8) 

a = l,...,p, 

where Aq is a constant and A is a smoothing parameter. The constant Aq can 
be fixed at any positive value and is included here only for computational 
considerations. 

Lemma 2. S'ei A = r^/(4Ao). (i) If f minimizes (6), setOa = Xl^"^ X~'^^'^ x 
then the pair {6, f) minimizes (8). (ii) On the other hand, if a pair 
{6,f) minimizes (8), then f minimizes (6). 

The form of (8) is very similar to the common smoothing spline (5) with 
multiple smoothing parameters, except that there is an additional penalty 
on the 0's. Differently from [23] where the sparsity in coefficients does not 
assure the sparsity in the functional components, for the COSSO procedure 
the sparsity of each component fj is controlled by a single parameter 6j. 
The additional penalty term on the 0's, X]j=i^i) shrinks them toward zero 
and hence makes the ^'s sparse, giving rise to zero function components in 
the COSSO estimate. 

We only need to tune A in the implementation of the COSSO, and Aq can 
be fixed at any positive value. In Section 6 we will see that an appropriate 
choice of Aq helps to put the tuning parameter on a natural scale and facil- 
itates the tuning of the COSSO. In contrast, the common smoothing spline 
has two sets of smoothing parameters A and 0's that are confounded. The 
common way to search for smoothing parameters iterates between A and the 
logo's, making it difficult to have zero components in the solution. 

4. A special case with the tensor product design. To illustrate the mech- 
anism of the COSSO for model selection, in this section we give an instructive 
analysis in the special case of a tensor product design with an SS-ANOVA 
model built from the second-order Sobolev spaces of periodic functions. We 
assume the e's in the regression model are independent with distribution 
A^(0, cr^). In a tensor product design the design points are 

3^12,2) • • • ) Xid,d) '■'i'k — ^1 ■ ■ ■ )^kjk = 1, . . . ,d}, 

where Xj^i^ = j/rik, j = 1, . . . , n^, k = 1, . . . ,d. Without loss of generality, we 
fix Ao = i in the COSSO (8) and focus on the case d = 2 with the SS-ANOVA 
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model being f{s, t) = b + /i(s) + f2{t) + fi2{s, t). We assume ni = 722 = m is 
an even integer. The sample size is then n = m^. 

The second-order Sobolev space of periodic functions can be written as 
r = {l}er, where 

{00 00 
f:f{t) = J2auV2cos27n^t + J2buV2sm27n^t, 
u=l u=l 

00 

with {al + 6^) (27rz^)^ <oo\. 
v=l J 

The norm in T is Hf/p = /o^{fl'"(i)}^ dt. When m is large, a good approximate 
subspace of T is Tm = {1} © with 

{m/2-l m/2-1 -x 

f:f{t)= ^ \/2 cos 27rz^t + ^ 6^\/2 sin 27rz^t + 0^/2 cos vrmi >. 
1^=1 i/=i J 

Wahba [19] used this subspace approximation to give a very instructive 
investigation of the filtering properties of the smoothing spline. Here we 
consider minimizing (8) in = ® = {1} © © © (f^ © f^), 
as the argument is instructive. The argument for the more general function 
space = © is similar but involves more technicality, and is deferred 
to the Appendix. 

In this case (8) becomes 

2 m m 



fc=l£=l 



^1 > 0,^2 > 0,^12 >0. 

Write 71 (t) = 1, 72i/(t) = \/2cos(27ri/t), j2u+i{t) = \/2sin(27rz^t) for u = 
1, . . . , m/2 — 1, and 7m(t) = cos(7rmt). Then any function in Tm can be writ- 
ten as g{t) = X^I^i o-ulu{t)^ and any function in can be written as 

m m 

(9) /(s,t) = 5]5]o^.7^(s)7.(t). 

/^=l 1/=! 

It is known that (see [19], page 23) 



m 
k=l 



1, if /i = 1/ = 1, . . . , m, 

0, if /i 7^ I/, /i, = 1, . . . , m. 
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Recall the inner product (•, •)„ of i?" defined in Section 2.4. Write ^fj,u{s, t) = 
7/i(s)7iy(^), and 7^1^ as the data vector corresponding to the function 7/i;/(s, t). 
From the above orthogonality relations and the tensor product design, we 
get 

\7m-i'7M2-2;„ |o, if^i/^2ori/i/z.2,/ii,i/i,/i2,i^2 = l,...,m. 

Therefore {7/^^, = l,...,m;i/ = 1,..., m} form an orthonormal basis in 
i?" with respect to the norm || • We then get from (9) that a^jy = (/, 'y^u)n- 
Write Zf,u = {y,'y^u)n- Then z^i, = a^u + where 5^^ ~ N{{),a'^/n) are 
independent. The COSSO problem can be written as 



(10) 



mm m m 

m m 

+ E E + ^(^1 + ^2 + ^12), 

^=2u=2 



with g^jy ~ ^^i/^ uniformly for fi ^ 1 or v ^ I, ^,iy = 1, ... ,m. Here ~ is read 
as "has the same order as." Therefore the minimizing a^jjy satisfies an = 
zii]afj,i = z^i6i{6i + for ji > 2;ai^ = ziy92{02 + qiuY^ , for > 2; 

a^v = ZfiuOuiOu + qfj.u)~^, for /U > 2, > 2; and (10) becomes 

{m ^ f ™ 1 

/i=2 J [u=2 J 

E E + ^12)"' + Aei2 . 

^=2 1/=2 J 

We see that the three components can be minimized separately. Let us con- 
centrate on 9i2, as 9i and 62 can be dealt with similarly. Let 

^(6*12) = E E + ^12)^^ + A^i2- 

Then ^'(6*12) = A - X;™=2 EIL2 QtivzlMi^v + 6'l2)~^ which increases as (9i2 > 
increases. Define U = ^"=2 ^17=2 V^-^^i^- If < A, then A'(0) > 0, ^'(^12) > 
for all 612 > 0, and the minimizing 9i2 of A is 0; otherwise the minimiz- 
ing 612 is larger than 0. Therefore we can see that the COSSO estimator 
selects components through a soft thresholding-type operation according to 
the magnitude of U. Notice 612 = implies /12 = 0. 

With the analysis above, we can now show that, when A — > and nX — > 00, 
the COSSO selects the correct model with probability tending to 1. Without 
loss of generality, let us concentrate on /12. 
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If /i2 = 0, then a^u = for any pair such that /i > 2 and > 2. 

So EiU) ~ aVnE;iL2EIL2/^-'^-' ~ n-V2,var(C/) = E;r=2 2^-^ x 
(j'^q'^y ~ n^^cr^. Therefore when nA — > 00, by Chebyshev's inequahty, 

pr([/ > A) < pr(|[/ - E{U)\ > A - < var(f/)/{A - ^(J7)}^ ^ 0. 

Therefore with probabihty tending to unity, C/ < A, and thus /12 = 0. 

On the other hand, if /12 / 0, then a^o^,yg / for some fJ.o'>2 and i/q > 2, 
and 

m m m m 

var(^) = ^ ^ V^var(z2 J = ^ ^ ^-2(4^-i„2^^2 + 2^"^^) 

;i=2 v=2 /i=2 1^=2 

{mm 
^=2v=2 ) 

= 4n"V2||/i2||i, + 2n-V = O(n-i). 
Therefore when A — > 0, by Chebyshev's inequahty, we get 

pr(f/ < A) < pr(|C/ - E{U)\ > E{U) - A) < var(C/)/{^(C/) - A}^ ^ 0. 
Therefore with probabihty tending to unity, U > X, and thus /12 7^ 0. 

5. Algorithm. For any fixed 0, the COSSO (8) is equivalent to the smooth- 
ing sphne (5). Therefore from the smoothing sphne hterature (e.g., [19]) it 
is weU known the solution / has the form f{x) = EiLi CiRe{xi,x) + 6, where 
c = (ci, . . . , Cn)^ G i?", b & R, and R0 = J2a=i ^aRa, with Ra being the re- 
producing kernel of With some abuse of notation, let Ra also stand for 
the nx n matrix Xj)}, i = 1, . . . ,n, j = 1, . . . ,n, let Rg also stand for 

the matrix Ea=i (^aRa, and let Ir be the column vector consisting of r Vs. 
Then we can write / = R0C + bin, and (8) can be expressed as 

^(y-^^e^RaC-bln] (y-'^OaRaC-blA 



(11) 



a=l / \ a=l 



P P 

+ XoJ2eaC^RaC + XY,&a, 
0=1 a=l 



where 9a>0, a = 1, . . . ,p. 

If the 0's are fixed, then (11) can be written as 

(12) min(y - Rgc - 61n)^(y - Rec - bin) + nXoc^Rgc. 

c,b 
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The solution to this smoothing sphne problem is given in [19]. 

On the other hand, if c and b are fixed, denote ga = RaC and let G be the 
n X p matrix with the ath column being ■ Simple calculation shows that 
the 9 = {ei,...,9p)'^ that minimizes (11) is the solution to 

p 

(13) inm{z-Ge)^{z-Ge) + nXy2Ga subject to 6'a > 0, a = 1, . . . ,p, 

where z = y — (l/2)nAoc — bin- 

Therefore a reasonable scheme would be to iterate between (12) and (13). 
In each iteration (11) is decreased. Notice that (13) is equivalent to 

p 

(14) inm{z-Gef{z-Ge) subject to 61^, > 0, a = 1, . . . ,p; V (9a < M, 

for some M > 0. We prefer to iterate between (12) and (14) for computa- 
tional considerations. 

Notice that the formulation (14) is exactly the problem in calculating the 
nonnegative garrote estimate. Therefore our algorithm iterates between the 
smoothing spline and the nonnegative garrote. The algorithm starts with 
a natural initial solution given by the smoothing spline, which is already a 
good estimate. By applying later iterations of our algorithm, we get what 
we view as an iterative improvement on the smoothing spline. A limited 
number of iterations is usually sufficient to achieve good performance in 
practical applications. This is in spirit similar to the basis pursuit algorithm 
in [2]. We observe empirically that the COSSO objective function decreases 
quickly in the first iteration, and the objective function after the first iter- 
ation is already very close to the objective function at convergence, as the 
magnitude of the decrease in the first iteration dominates the decreases in 
subsequent iterations. This motivates us to consider the following one-step 
update procedure: 

1. Initialization: Fix 9^ = 1, a = 1, . . . ,p. 

2. Solve for c and b with (12). 

3. For c and b obtained in step 2, solve for 9 with the nonnegative garrote 
(14). 

4. With the new 9, solve for c and b with the smoothing spline (12). 

This one-step update procedure has the flavor of the one-step maximum 
likelihood procedure in which a one-step Newton-Raphson algorithm is ap- 
plied to a good initial estimator and which is as efficient as fully iterated 
maximum likelihood. A discussion of the one-step procedure and the fully 
iterated procedure (in a different algorithm) can be found in [6]. In our ex- 
perience, the one-step update procedure and the fully iterated procedure 
have comparable estimation accuracy. 
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6. Choosing the tuning parameter. The generalized cross-validation pro- 
posed by Craven and Wahba [4] is one of the most popular methods for 
choosing smoothing parameters in the smoothing spline method. Let A be 
the smoothing matrix of the smoothing spline. That is, y = Ay. The gener- 
alized cross-validation estimate of the risk is 



{n-itr(/-^)}2' 

Tibshirani [15] proposed a GCV-type criterion for choosing the tuning 
parameter for the LASSO through a ridge estimate approximation. This 
approximation is particularly easy to understand in light of the form (8) 
for the linear model /(x) = Pq + J2'j=i PjX^-^"^: fix the 6j at their estimated 
values 6j, and calculate GCV for the corresponding ridge regression. This 
approximation ignores some variability in the estimation process. However, 
the simulation study in [15] suggests that it is a useful approximation. This 
motivates our GCV-type criterion: We use the GCV score for the smoothing 
spline in (8) when the ^'s are fixed at the solution. 

Another popular technique for choosing tuning parameters is fivefold or 
tenfold cross-validation. The computation load of GCV is smaller. We com- 
pare the performance of these two criteria in the COSSO with simulations. 
It is also possible to use the Cp criterion based on the concept of general- 
ized degrees of freedom [13, 22]. We do not consider this possibility since 
in our problem there is no explicit formula for the degrees of freedom and 
numerical evaluations tend to be computationally intensive. 

The following is the complete algorithm for the COSSO with adaptive 
tuning: 

1. Fix 6a = I, a = I, . . . ,p. Solve the smoothing spline problem, and tune Aq 
according to CV or GCV. Fix Aq at the chosen value in all later steps. 

2. For each fixed M in a reasonable range, apply the one-step COSSO algo- 
rithm with M. Choose the best M according to CV or GCV. The solution 
corresponding to this chosen M is the final solution. 

In our simulations it is noticed that once Aq is fixed according to step 1, 
the optimal M seems to be close to the number of important components. 
This helps to determine the range of tuning for M. 

7. Simulations. In this section we study the empirical performance of 
the COSSO estimate in terms of estimation accuracy and model selec- 
tion. We compare the COSSO with GCV, with fivefold cross-validation, 
and with MARS, which is a popular stepwise forward-backward proce- 
dure for building functional ANOVA models. The measure of accuracy is 
the integrated squared error ISE = Ex{fiX) — f{X)}'^, which is estimated 
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by Monte Carlo integration using 10,000 test points from the same distri- 
bution as the training points. We run each simulation example 100 times 
and average. The matlab code for the COSSO is available from webpages 
of the authors (www.stat.wisc.edu/~yilin or www4.stat.ncsu.edu/~hzhang). 
The MARS simulations are done in R, with the function "mars" in the 
"mda" library contributed by Trevor Hastie and Robert Tibshirani. 

The following four functions on [0, 1] are used as building blocks of regres- 
sion functions in some of the simulations: gi{t) = t; g2{t) = {2t — 1)^; 53 (t) = 
2-s"in(2Sf) ' £'4(i) = 0.1 sin(27rt) -hO.2 cos(27rt) -hO.3 sm'^{2-Kt) + 0Acos^{2-Kt) + 
0.5sin^(27rt). Consider two covariance structures of the input vector X, with 
varying correlation: 

Compound symmetry. Let X^^^ = {Wj + tU)/{l + t), j = l,...,d, 

where Wi, . . . , Wd and U are i.i.d. from Uniform(0, 1). Therefore corr(X^-'\ 
X^*')) = t^/(l + t^) for j k. The uniform design corresponds to the 
case t = 0. 

{Trimmed) AR(1). Let Wi,...,Wd be i.i.d. N{0, 1), and X^^) = Wi, X^^'^ = 
pX(^-^^ + {l-p'^)^/^Wj,j = 2,...,d. TnmX^i^ in [-2.5, 2.5] and scale 
to [0, 1]. 

Example 1. Consider a simple additive model in R^^, with the under- 
lying regression function f{x) = 5gi{x^^^) + 352(2^^^^) + 4:gs{x^^^) + 6g4{x^^^). 
Therefore X^^\ . . . , X^^°^ are uninformative. We consider the sample size 
n = 100. Generate y = f{x) + e, where e is distributed as A^(0, 1.74). The 
standard deviation of the noise was chosen to give a signal-to- noise ratio 3:1 
in the uniform case. For comparison, the variances of the component func- 
tions are vai{5gi{XW)} = 2.08, var{352(^^^^)} = 0.80, var{453(X(3))} = 
3.30 and var{654(xW)} = 9.45. 

We apply the COSSO with additive models (the additive COSSO) to 
the simulated data. Therefore there are ten functional components in the 
model. Figure 1 shows how the magnitudes of the estimated components 
change with the tuning parameter M in one run. The magnitudes of the 
functional components are measured by their empirical Li norms, defined 
as l/nJ2i=i \fj{4^'^)\ for j = 1, . . . ,d. The Aq in this run is fixed at 9.7656 x 
10~^. Both GCV and fivefold cross-validation choose M = 3.5, giving a 
model of five terms in this run. The estimated function components are 
plotted along with the true function components in Figure 2. Notice the 
components are centered according to the ANOVA decomposition. 

For each setting of covariance structure, we run the simulation 100 times 
and average. The resulting average integrated squared error and its asso- 
ciated standard error (in parentheses) are given in Table 1. Also included 
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in the table is the average integrated squared error of MARS for additive 
models. We can see the two COSSO procedures perform better than MARS 
in all the settings studied. To study the performance of the COSSO in terms 
of model selection, we determine in the uniform case the number of times 
each variable appears in the 100 chosen models (Table 2), and the number 
of terms in the 100 chosen models (Table 3). In our calculation we take 6 to 
be zero if it is smaller than 10"^. The COSSO with fivefold cross-validation 



3 




The tuning parameter M 

Fig. 1. The empirical Li norm of the estimated components as plotted against the tuning 
parameter M in one run of Example 1 . 



Table 1 

Comparison of average integrated squared errors for Example 1 







Comp. symm 






AR(1) 






t = 


t = l 


t = 3 


p= -0.5 


p = 


p = 0.5 


COSSO(GCV) 
COSSO(5CV) 
MARS 


0.93 (0.05) 
0.80 (0.03) 
1.57 (0.07) 


0.92 (0.04) 
0.97 (0.05) 
1.24 (0.06) 


0.97 (0.07) 
1.07 (0.06) 
1.30 (0.06) 


0.94 (0.05) 
1.03 (0.06) 
1.32 (0.07) 


1.04 (0.07) 
1.03 (0.06) 
1.34 (0.07) 


0.98 (0.07) 
0.98 (0.05) 
1.36 (0.08) 
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0.5 1 0.5 1 0.5 1 




0.5 1 0.5 1 



«4 ^ 

Fig. 2. The estimated component (dashed line) and true component (solid line) functions 
in one run of Example 1. Shown are the components for variables 1,2,3,4 and 7. For the 
other variables, both true and estimated components are zero. 

misses the second variable six times, but chooses the correct four- variable 
model 84 times. The COSSO with GCV and MARS do not miss any im- 
portant variable, but tend to include uninformative variables in the chosen 
models. The COSSO with GCV chooses the correct four-variable model 57 
times, while MARS does so only four times. 

Table 4 gives the mean and standard deviation of the model sizes chosen 
by the methods in various settings. The settings considered are compound 
symmetry with t = 1 and 3 and trimmed AR(1) with p = —0.5, and 0.5. 
The average model size chosen by the COSSO with fivefold cross-validation 
is close to 4, the size of the true model. The COSSO with GCV selects 
slightly larger models. The models chosen by MARS are even larger. 

Example 2. Consider a larger model with d = 60 and the regression 
function 
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Table 2 

Appearance frequency of the variables in the models 
in the uniform setting 











Variable 












1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


COSSO(GCV) 


100 


100 


100 


100 


14 


11 


18 


15 


11 


13 


COSSO(5CV) 


100 


94 


100 


100 


1 


1 


3 


2 


4 


2 


MARS 


100 


100 


100 


100 


35 


35 


34 


39 


28 


35 



+ 1.551 + 1.5g2(x(*^)) + 1.553(x(^)) + 1.5g4(x(^)) 
+ 2gi(x(9)) + 2<72(x(^°)) + 2<73(x(i^)) + 2g^{x^'^'>). 

Therefore there are 48 uninformative variables. Let n = 500. The variance 
of the normal noise is 0.5184, to give a signal-to-noise ratio of 3:1 in the 
uniform case. For comparison, in the uniform setting vaT{gi{X^^^)} = 0.08, 
var{52(^^^^)} = 0.09, var{53(X(3))} = 0.21 and var{c/4(X(^))} = 0.26. Both 
COSSO and MARS are run 100 times with additive models. The results are 
summarized in Table 5. We see that the two COSSO procedures outperform 
MARS, with the COSSO with fivefold cross-validation doing slightly better 
than the COSSO with GOV. 



Table 3 

Frequency of the size of the models in the uniform setting 











Model 


size 










3 


4 


5 


6 


7 


8 


9 


10 


Mean 


COSSO(GCV) 





57 


17 


18 


5 


2 





1 


4.82 


COSSO(5CV) 


6 


84 


7 


3 














4.07 


MARS 





4 


24 


40 


26 


6 








6.06 



Table 4 

Mean and standard deviation of the model sizes 





Comp. symm. 




AR(1) 




t = l 


t = 3 


p = -0.5 


p = 


p = 0.5 


COSSO(GCV) 


4.8 (1.2) 


4.8 (1.5) 


4.7 (1.2) 


4.8 (1.3) 


4.6 (1.2) 


COSSO(5CV) 


4.1 (1.2) 


4.4 (1.9) 


4.1 (1.2) 


4.0 (1.0) 


3.8 (0.9) 


MARS 


6.3 (0.9) 


6.2 (0.9) 


6.1 (1.0) 


6.1 (0.8) 


5.9 (0.8) 
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Example 3. We consider a 10-dimensional regression problem with sev- 
eral two-way interactions: 

/(X) =5l(x«) +52(x(^)) +53(x(3)) +ff4(x(')) 

+ 5i(x(3)xW) + g, ( "^'^+"^'^ ) + <73(x(i)x(2)). 

We consider the uniform setting and set the noise to be normal with stan- 
dard deviation 0.2546, to give a signal-to-noise ratio of 3:1. The average inte- 
grated squared errors are given in Table 6 for sample sizes n = 100, 200, 400. 
Both the COSSO and MARS are run with the two-way interaction model. 
We follow the advice in [8] to set the cost for each basis function optimization 
to be 3 in the MARS for two-way interaction models. 

There are 55 function components in the COSSO. The COSSO does not 
do well when n = 100. It seems that there are too many function components 
for the COSSO to select from with 100 data points. MARS does not suffer 
from a small sample size so much as the COSSO. Part of the reason is that 
the MARS algorithm introduces a certain hierarchical order of the terms 
being searched from: only after a univariate basis function is included in 
the model will the product of other terms with it become a candidate for 
inclusion in later steps. In contrast, the COSSO selects from all the function 
components and does not distinguish between main effects and interaction 



Table 5 

Average ISE (unit lO^^J and model sizes with their standard errors 





Comp. symm. 


AR(1) 




t = 


t = 1 


p = 0.5 


p= 0.5 


ISE COSSO(GCV) 


201 (4) 


178 (5) 


199 (6) 


183 (5) 


COSSO(5CV) 


144 (4) 


162 (5) 


153 (4) 


149 (5) 


MARS 


353 (7) 


302 (7) 


286 (6) 


280 (5) 


Model sizes COSSO(GCV) 


18.0 (4.1) 


18.0 (4.1) 


19.0 (5.1) 


18.0 (4.3) 


COSSO(5CV) 


12.0 (0.2) 


11.7 (1.4) 


12.1 (1.4) 


11.9 (1.0) 


MARS 


35.2 (2.3) 


36.1 (2.1) 


35.2 (2.5) 


35.9 (2.4) 




Table 6 






Average integrated squared errors for Example 3 






n = 100 


n = 200 


n = 400 




COSSO(GCV) 0, 


.358 (0.009) 


0.100 (0.003) 


0.045 (0.001) 




COSSO(5CV) 0, 


.378 (0.005) 


0.094 (0.004) 


0.043 (0.001) 




MARS 0, 


.239 (0.008) 


0.109 (0.003) 


0.084 (0.001) 
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terms. Therefore the COSSO does not assume any hierarchical structure, 
and may not be efficient when the true model is hierarchical and the sample 
size is small. However, as the sample size increases, the COSSO procedures 
catch up quickly. Their performance is comparable to MARS when n = 200 
and better than MARS when n = 400. 

In the above examples we see that in general the COSSO with fivefold 
cross-validation tends to do better than the COSSO with GCV. We therefore 
recommend the use of fivefold cross-validation with the COSSO unless the 
computation time is a crucial factor. In the following examples the COSSO 
is tuned with fivefold cross-validation. 

Example 4 (The circuit example). This is an example from [8]. Of 
interest is the dependence of the impedance Z of a circuit and phase shift 
(f) on components in the circuit. The true dependence is described by 

Z=[i?2 + {cuL-l/(c^C)}2]V2, 

The input variables are uniform in the range < R< 100 , 407r <uj< 5607r, 
0<L<1, 1<C<11, and the noise is normal with the standard deviation 
set to give a signal-to- noise level 3:1. 

This is a relatively small problem with d = 4. All orders of interactions 
are present. Friedman [8] applied MARS with the additive model, the two- 
way interaction model and the saturated model to this example, and found 
that the performance of the two-way interaction model was the best. We 
scale the input region to [0, 1]^ and apply the COSSO with fivefold cross- 
validation. With the small dimension, it is possible to apply the COSSO with 
the saturated model, which has 2^ — 1 = 15 function components. However, it 
turns out that the two-way interaction COSSO does slightly better than the 
saturated model. We compare the integrated squared error of the two-way 
interaction COSSO and that of the two-way interaction MARS in Table 7. 
It turns out that the COSSO performs much better than MARS. We note 
that in Table 7 the numbers reported for MARS when re = 200 are calculated 
after excluding one large extreme outlier in ISE, and those for MARS when 
re = 400 are calculated after excluding three large extreme outliers in ISE. 

8. Real examples. We now apply the COSSO to three real datasets: 
Ozone data, Boston housing data and Tecator data. The first two datasets 
are available from the R library "mlbench." In the Ozone data, the daily 
maximum one-hour-average ozone reading and eight meteorological variables 
were recorded in the Los Angeles basin for 330 days in 1976. The Boston 
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Table 7 

Average integrated squared error for estimating the impedance 
Z (in units of 10^) and the phase shift cj) (in units of 10~^^ 





n = 100 


n - 


200 


n - 


= 400 


for Z COSSO 


1.91 (0.12) 


0.85 


(0.05) 


0.51 


(0.03) 


MARS 


5.57 (0.41) 


2.47 


(0.16) 


1.37 


(0.08) 


for 4, COSSO 


12.98 (0.36) 


7.96 


(0.20) 


5.36 


(0.10) 


MARS 


20.59 (0.96) 


12.60 


(0.71) 


8.19 


(0.14) 



Table 8 

Estimated prediction squared errors and their standard 



errors 


Ozone 


Boston 


Tecator 


COSSO 16.04 (0.06) 


9.89 (0.08) 


0.92 (0.02) 


MARS 18.24 (0.45) 


14.31 (0.34) 


4.99 (1.07) 



housing data concerns housing values in the suburbs of Boston. There are 
12 input variables. The sample size is 506. The Tecator data is available 
from the datasets archive of StatLib at lib.stat.cmu.edu/datasets/. The data 
was recorded on a Tecator Infratec Food and Feed Analyzer working in the 
wavelength range 850-1050 nm by the Near Infrared Transmission (NIT) 
principle. Each sample contains finely chopped pure meat with different 
moisture, fat and protein content. The input vector consists of a 100-channel 
spectrum of absorbances. The absorbance is — log^o of the transmittance 
measured by the spectrometer. As suggested in the document, we use the 
first 13 principal components to predict the fat content. The total sample 
size is 215. 

We apply the COSSO and MARS on these datasets, and estimate the pre- 
diction squared errors E[{Y — f{X)}'^] by tenfold cross-validation. We select 
the tuning parameter by fivefold CV within the training set. The estimate 
obtained is then evaluated on the test set. We do this tenfold cross-validation 
five times and then average. For all three datasets, both the COSSO and 
MARS find the two-way interaction model has better prediction accuracy 
than the additive model. Therefore we choose to apply the two-way inter- 
action model and the results are given in Table 8. We can see the COSSO 
does considerably better than MARS. 

The COSSO algorithm can be naturally extended to tackle the nonpara- 
metric classification problem where the response y is either or 1. In this 
case we replace the square loss in (6) by the logistic regression loss function. 
The first term in (6) is then replaced by i{f) = l/nJ2f=i[—yifixi) + log(l + 
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g/(^i))]^ where / is the logit function. We can then solve the optimization 
problem by applying the quadratic approximation to i{f) iteratively. This 
leads to the iteratively reweighted least squares (IRLS) procedure, which is 
equivalent to a Newton-Raphson algorithm. Using this approach, we can 
solve our optimization problem by iterative application of the COSSO algo- 
rithm, within an IRLS loop. We illustrate the performance of this procedure 
by comparing it with a number of machine learning algorithms on several 
high-dimensional real datasets. 

van Gestel et al. [18] conducted a benchmark study comparing a number 
of commonly used machine learning techniques including the support vector 
machine (SVM), least squares SVM (LS-SVM), linear discriminant analysis 
(LDA), quadratic discriminant analysis (QDA), logistic regression (Logit), 
the decision tree algorithm C4.5, Holte's one-rule classifier (oneR), instance- 
based learners (IB) and the Naive Bayes method. There are five binary clas- 
sification datasets with continuous predictors, and we test the performance 
of the COSSO on these datasets. The datasets are the BUPA Liver Disor- 
der data, the Johns Hopkins University Ionosphere data, the PIMA Indian 
Diabetes data, the Sonar, Mines vs. Rocks data and the Wisconsin Breast 
Cancer data. The basic features of the datasets and the performance of dif- 
ferent algorithms are summarized in Table 9. Due to the high dimension of 
these problems, we only consider the COSSO additive model. 

Following [18], for each dataset we randomly select 2/3 of the data for 
training and tuning, and test on the remaining 1/3 of the data. We do 
this randomization ten times and report the average test set performance 
and sample standard deviation for the COSSO. The results for the other 
algorithms are taken from [18]. van Gestel et al. [18] included six types of 
LS-SVM's and found that the LS-SVM with the radial basis function (RBF) 
kernel performs best overall. To save space we only include the LS-SVM with 
RBF kernel and linear kernel in Table 9. van Gestel et al. [18] included two 
instance-based learners (IBl and IBIO in [18]). We combine them and report 
the better performance of the two. The same is done for the two Naive Bayes 
methods considered in [18]. The best average test set performance is denoted 
in boldface for each dataset in Table 9. We can see the COSSO gives very 
competitive performance on these benchmark datasets. 

9. Discussion. The difference between the COSSO and the common smooth- 
ing spline in smoothing spline ANOVA mirrors that between the LASSO and 
ridge regression in linear models. Compared with other model selection al- 
gorithms based on greedy search, the COSSO optimizes a global criterion 
and provides a shrinkage estimate. We have shown that the COSSO has 
attractive properties for model selection and estimation. 

One future research topic is statistical inference based on the COSSO. 
Traditionally inference after model selection is based on the selected models. 
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Table 9 

Comparison of the test set performance of the COSSO with the performance of SVM, 
LS-SVM, LDA, QDA, Logit, C4.5, oneR, IB, Naive Bayes and Majority Rule 





BUPA 


Ionosphere 


Pima Indian 


Sonar MR 


Wise. BC 


n 


345 


351 


768 


208 


683 


d 


e 




33 


8 


60 


9 


COSSO 


71.1 


(3.5) 


91.1 


(3.7) 


77.3 (2.2) 


79.0 


(4.5) 


97.0 (0.8) 


SVM (linear) 


67.7 


(2.6) 


87.1 


(3.4) 


77.0 (2.4) 


74.1 


(4.2) 


96.3 (1.0) 


SVM (RBF) 


70.4 


(3.2) 


95.4 


(1.7) 


77.3 (2.2) 


75.0 


(6.6) 


96.4 (1.0) 


LS-SVM (linear) 


65.6 


(3.2) 


87.9 


(2.0) 


76.8 (1.8) 


72.6 


(3.7) 


95.8 (1.0) 


LS-SVM (RBF) 


70.2 


(4.1) 


96.0 


(2.1) 


76.8 (1.7) 


73.1 


(4.2) 


96.4 (1.0) 


LDA 


65.4 


(3.2) 


87.1 


(2.3) 


76.7 (2.0) 


67.9 


(4.9) 


95.6 (1.1) 


QDA 


62.2 


(3.6) 


90.6 


(2.2) 


74.2 (3.3) 


53.6 


(7.4) 


94.5 (0.6) 


Logit 


66.3 


(3.1) 


86.2 


(3.5) 


77.2 (1.8) 


68.4 


(5.2) 


96.1 (1.0) 


C4.5 


63.1 


(3.8) 


90.6 


(2.2) 


73.5 (3.0) 


72.1 


(2.5) 


94.7 (1.0) 


oneR 


56.3 


(4.4) 


83.6 


(4.8) 


71.3 (2.7) 


62.6 


(5.5) 


91.8 (1.4) 


IB 


61.3 


(6.2) 


87.2 


(2.8) 


73.6 (2.4) 


77.7 


(4.4) 


96.4 (1.2) 


Naive Bayes 


63.7 


(4.5) 


92.1 


(2.5) 


75.5 (1.7) 


71.6 


(3.5) 


97.1 (0.9) 


Majority Rule 


56.5 


(3.1) 


64.4 


(2.9) 


66.8 (2.1) 


54.4 


(4.7) 


66.2 (2.4) 



resulting in biased inference. Shen, Huang and Ye [12] proposed a method 
to make approximately unbiased inference. It is of interest to see if their 
method can be adapted to our problem to give unbiased inference. 

APPENDIX 

Proofs. 

Proof of Theorem l. Denote the functional to be minimized in (6) 
by A{f); then A{f) is convex and continuous. Without loss of generality, we 
assume r = 1 . 

By (7) we have that J(/) > ||/||, for any f £ J^i. Let Rj^^ be the re- 
producing kernel of and let {•,-):Fi be the inner product in Ti. Denote 

1 /2 

a = maxf^i R^^ {xi,Xi). By the definition of a reproducing kernel, we have 
for any f £ J^i and z = 1, . . . , n, 

\f{xi)\=\{f{.),RrAx^,■))^,\<\\f\\{R^A^i,■),R^Axi,■))'J' 

(A.l) 

= ||/||i?X'(x„Xi)<a||/||<aJ(/). 
Denote p = Tnaeoc^^^d/f + \yi\ + 1). Consider the set 
D = {feJ^:f = b + h, 

with b e {!}, /i G .Fi, J(/) < p, \b\ < p"'^ + (a + l)p}. 
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Then D is a closed, convex and bounded set. Therefore by Theorem 4 of 
[14], page 162, there exists a minimizer of (6) in D. Denote the minimizer 
by /. Then A{f) < A{0) < p. 

On the other hand, for any f with J(/) > p, clearly > J(/) > p; 
for any f^T with J(/) <p, f = b + fi,b£ {1}, /i G ^ and |6| > p^/^ + (o + 
l)p, we use (A.l) to get that, for any i = 1, . . . ,n, 

\b + fiixi)-yi\ > [p^/^ + {a + l)p]-ap-p = p^/^. 

Therefore A{f) > p. For any f ^ D, A(f) > A(f), that is, / is a minimizer 
of (6) in J^. □ 

Proof of Theorem 2. For any / G JF, we can write f{x) = c+/i(x(^)) + 
• • • + = c + g{x), such that Er=i fji4'^) = 0, j = 1, . . . , d. Similarly, 

write /o(x) = cq + /oi(x(^)) H h fod{x^'^^) = cq + 50 (a;) and /(x) = c + 

+ - • • + = c + g{x). By construction E^=l{ffo(2;^) -^(a^i)} = 0, 

and we can write (6) as 

2 ^ \ ^ 

(co - cf + -(co - c) V ei + - V{5o(2;j) + - ^(a^i)}^ + J (5)- 
n ^-^^ n 

1=1 1=1 

Therefore, the minimizing c must minimize (cq — c)^ + 2/n(co — c) X]r=i 
That is, c = Co + l/nE"^;^ej. Therefore (c — cq)^ converges with rate . 
On the other hand, g must minimize 

1 " 

-Y^{m{xi)^ei- g(xi)\'^ ^TlJ{g). 
1=1 

Let g = {<7G^:5(a:) = /i(x«) + --- + /d(xW), with ^^i = 0, j = 
1, . . . , d}. Then go g £ G- The conclusion of Theorem 2 then follows from 
Theorem 10.2 of [17] and the following lemma. □ 

Lemma A.l. Let Hoo{S,Q) be the 6-entropy of Q for the supremum 
norm. Then 

HooiS, {g€g: J{g) < 1}) < ^d^^+^^/^r 
for all 6 > 0, n > 1 and some ^4 > not depending on 5, n or d. 

Proof. Define as the set of univariate functions of x*--'^ 
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where S(, is the ^th-order Sobolev space. Then from (3), any h£ satisfies 

e-2 1 
(A.2) y[h^''\l)-h^''\0)f + {h^^\t)fdt<l. 

We first show that for any h£G^ , we have \h\oo = {sup^gjg,!] I^C'^)!} 1^ 1- 
For any fixed h , define K = {k:0<k<i — 1, and for any integer q G 

[0, k], there exists G [0, 1] satisfying h^'^^ag) = 0}. Since ^"=1 Kx'f^) = 0, 
we have £ K. Let ko be the largest number in K. Now we consider the two 
possibilities k^^ i — 1 and kQ = i — 1 separately. 

If feo / £ - 1, then ko<e-2, ko G K and ko + l^K. By the definition 
of K, we see that h^^''^ is monotone and crosses the x-axis. So for any 
s E [0,1], we have \h^''°\s)\ < |/i('^'o)(l) - h^''»\0)\ < 1. The last inequality 
follows from (A.2). From this and the fact that h^^o~^) crosses the x-axis, 
we get \h'^^°~^\s)\ < 1, Vs G [0,1]. Continuing with this argument, we get 

|/i|oo<l. 

On the other hand, if /cq = ^ — 1, then max^ h^^-^\s) > 0, and 
mins/i(^~^)(s) < 0- % (A.2) we get 

1 > j\h^^\t)fdt > [J\h^^\t)\ dtj'^ > |max/i(^~i)(s) - mm h^^-^\s)^\ 

Therefore -1 < min^ /i(^~i)(s) < < max^ h^'^-'^\s) < 1. That is, \ h^^-'^\s)\ < 
1, Vs G [0, 1]. Now by the definition of K and that i — 1 € K, we know that 
crosses the x-axis for any integer k £ [0, ^ — 1]. Therefore we get |/i|oo < 1- 
Therefore we have shown that |/i|oo ^ 1 for any h € . It then follows 
from Theorem 2.4 of [17], page 19, that 

(A.3) Hoo{d,g^)<A6''/' 

for all (5 > and n > 1, and some positive A not depending on 6 and n. 

By the definition of G and the , we see that in terms of the supreme 
norm, if each , j = 1, . . . ,d, can be covered by N balls of radius 6, then the 
set {g G '■ J{g) < 1} can be covered by A^'^ balls with radius d5. By (A.3) 
we get 

Hoo{d5, {geG: J(g) < 1}) < Ad6-'/', 
and the conclusion of the lemma follows. □ 

Proof of Lemma 1. For any / G JF, we can write / = b + J2a=i fa with 
fa G .F°- Let the projection of fa onto span{i?Q,(xj, •), i = 1, . . . , n} C be 
ga and its orthogonal complement be ha- Then fa = ga + ha, and = 
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Ibaf + 11^ 

oi\\ , a — 1, . . . ,p. Since R — 1 + X/a=i is the reproducing kernel 
of J^, we have, making use of the orthogonal structures, 

IP p \ ^ 

f{Xi) = ( 1 + ^Ra{Xu-).b+ J2(9a + ha)) = b+ ^ (Xj , •) , Sa) , 
\ a=l a=l I a=l 

where (•, •) is the inner product in T. Therefore (6) can be written as 
1 n r p ^ 2 p 

- E - ^ - E (^"(^- ■)^9a) + r^Y.^\\ga\? + \\K\?f'\ 

Therefore any minimizing / satisfies /iq, = 0, a = 1, . . . ,p, and the conclusion 
of the lemma follows. □ 

Proof of Lemma 2. Denote the functional in (6) by A{f) and the func- 
tional in (8) by B{9,f). We have Ao^-^HP^/f + > 2AJ/^A1/2||P"/|| = 
^2 II pay for any 9a > and f € J^, and equality holds if and only if 

= Ap/^A-V^iipajii^ Therefore B{e,f) > A{f) for any e„ > and / G J^, 

and equality holds if and only if 6a = Ay^A~"^/^||P"/||, a = 1, . . . ,p. The 
conclusion of the lemma follows. □ 

Further derivations in the tensor product design case. Now we consider 
the function space = T-*^ (g)T^. Define S = {K{xi^i,Xj^i)}mxm, the marginal 
kernel matrix corresponding to the reproducing kernel of T. With a little 
abuse of notation, let Rj, j = 1,2, also stand for the n x n matrix of the 
reproducing kernel Rj evaluated at the n data points, and the same for 
Ri2. Suppose the data points are permuted appropriately; we have R12 = 
S S, where stands for the Kronecker product of matrices. Let Im be 
the column vector consisting of m I's. For the main effect spaces, we have 
i?i = S (l^l^) and R2 = (Iml^) ® S. 

Straightforward calculation gives Sl^ = mtlm, where t = l/(720m^). 
Let {^1 = lm.,(,2, ■ ■ ■ ,S,m} be an orthonormal (with respect to the inner 
product (•,-)m in R^) eigensystem of S, with corresponding eigenvalues 
mr]i,mr]2, ■ ■ ■ ,mr]m, where r]i = t, and 772 > Vm- Then it is well 

known that rji ~ for i > 2. See [16]. Notice Ci^^2, ■ ■ ■ ,Cm are also the eigen- 
vectors of Imlm) with corresponding eigenvalues being m,0, ...,0. Write 
C^iu = ^ Cu- It is then easy to check that {^^jy : /i, = 1, . . . , m} form an 
eigensystem of Ri, R2 and Ru- The eigenvalues of Ri, R2 and R12 are, 
respectively, 

ri,^,i = nr]^; n^^j, = for^>l,zy>2; 
'^2,11' = ^Vu, r2,^u = for /i > 2, 1/ > 1; 
^12,^1^ = n7]^r]^ for /i > 1, 1/ > 1. 
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It is clear that {^/^i/ : /i, = 1, . . . , m} is also an orthonormal basis in i?" 
with respect to the inner product (•, •)„. Consider the vector of length n of 
function values at the sample points: / = (/(a^fc.ii 2:^,2) : A;, £ = 1, . . . , m)^ . Let 

be the n x n matrix with columns being the vectors /U, 1/ = 1, . . . , m. 
Then O^O = nl. Denote a = [oay : n,u = 1, . . . , m)'^ = (l/n)0^/ and z = 
{z^^:^,u = 1, . . .,m)^ = {\/n)0^y. That is, 

(^fiv ~ {fjCiJ.iy)m Z^i, = {y,£,fj,i/)n'-i 

then / G R"" can be expanded in terms of the orthonormal basis, 
/ = H af,uC^lu = h + h + h + fi2, 

fJ,,U 

where /o = an^n, /i = J2'^=2a^ilC^^l, h = YJ^=2(^\vi\v and /12 = 
Ylll=2ni)=2(^i>.vii>.v Then all the components /o, /i, h and /12 are orthog- 
onal in EP" . Furthermore, we have z^^ = a^j^^ + S^j^u, where 5^u ^ N{Q,a'^ /n), 
for /i > 1, z> > 1. 

Now let us consider the COSSO estimate (11), 

1 ^ 

— {y — R0C — bln)^{y — Rec — blri)+c^Rec+\'y^Oa subject to 6*^ > 0, 

n 

Q = l 

a = l,...,p, 

where Re = Y.l=i^aRa- Let s = 0"^c, = {I / n^)0'^ R^O . Then D^, is a 
diagonal matrix with diagonal elements r^t^^u/n, a = 1, 2 or 12. The COSSO 
problem can be written as 

p 

{z - Des - (6, 0, ... , 0)T)^(z - Des - (6, 0, ... , 0)^) + s^Des + A ^ 

a=l 

where Dq = Y^^a=i ^aDa- It can then be shown by straightforward calculation 
that, for the minimizing s, b and 6, su =0, b = z\\ and s and Q minimize 

ii>2 

+ \Wv - Vu{02 + 0l2t)siu}^ + r/^(^2 + 0i2t)sl^] 
u>2 

+ Y + ei2'n^'nusl,y\ + \{9i + 92 + 0i2). 

H>2,v>2 

Therefore, at the minimum we have 

s^i = {i + r]^[ei + ei2t)]-^z^i, ^l>2■ 

siu = {l+riu{02 + ei2t)}-^zxy, u>2- 
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and the 6's minimize 

A{9i,e2, 012) = J2 + viiOi + VfiOi2ty^ + E ^1.(1 + ^-^2 + vM)-^ 

^t>2 i/>2 

+ E ^'.(l + ^12W)~'+A(01+e2 + ^12), 
^>2,j/>2 

subject to 01 >0,02> 0, ^12 > 0. A calculation similar to that in Section 4 
then shows that, when A is appropriately chosen, the COSSO selects the 
correct components with probability tending to 1. 
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